import sklearnfrom sklearn import datasetsimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport randomfrom collections import Counterfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_recall_fscore_support
Introductions
Decision Trees:
A Decision Tree is a tree-like structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents the tree outcome. The top node in a tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in a recursive manner called recursive partitioning. This tree-like structure helps in decision making.
Random Forest:
Random Forest is an ensemble learning method, where multiple decision trees are combined to increase the overall performance. While a single decision tree might lead to overfitting, Random Forest averages multiple trees to reduce overfitting and improve prediction accuracy. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.
Methods
For this part of the analysis, I will be applying both classification and regression for decision tree analysis, I will be applying the classification methods to binary data Cardiovascular dieases (CVD) and regressino methods to continuous numeric data GLUCOSE.
For this part of the analysis, I will be using the Framingham Heart Study dataset@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be "CVD", which is Cardio Vascular diseases.
Code
# Load the Frammingham Heart Study data setdata = pd.read_csv("data/frmgham2.csv")data.head()# Split target variables and predictor variablesy = data['CVD']x = data.drop('CVD', axis=1)x = x.iloc[:, 1:]x = x.valuesy = y.values# Calculating class distribution for 'DIABETES'cd = data['CVD'].value_counts(normalize=True) *100print(cd)
The results show binary data, about roughly 75% of data goes to 0 which is negative test for CVD, and about 25% of data goes to 1 which is positive test for CVD. We have a much higher percentages of data on negative tests, this could lead to a biased model favoring the majority class, we need a more detailed evaluation part to address this issue.
corr = data.corr()sns.set_theme(style="white")f, ax = plt.subplots(figsize=(11, 9)) # Set up the matplotlib figurecmap = sns.diverging_palette(230, 20, as_cmap=True) # Generate a custom diverging colormap# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})plt.show();
Baseline model for comparison
Code
## RANDOM CLASSIFIER def random_classifier(y_data): ypred=[]; max_label=np.max(y_data);#print(max_label)for i inrange(0,len(y_data)): ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))print("-----RANDOM CLASSIFIER-----")print("count of prediction:",Counter(ypred).values()) # counts the elements' frequencyprint("probability of prediction:",np.fromiter(Counter(ypred).values(), dtype=float)/len(y_data)) # counts the elements' frequencyprint("accuracy",accuracy_score(y_data, ypred))print("percision, recall, fscore,",precision_recall_fscore_support(y_data, ypred))np.random.seed(42) random.seed(42)random_classifier(y)
-----RANDOM CLASSIFIER-----
count of prediction: dict_values([5874, 5753])
probability of prediction: [0.50520341 0.49479659]
accuracy 0.5015051173991572
percision, recall, fscore, (array([0.7495744 , 0.24821832]), array([0.50446838, 0.49258365]), array([0.60306807, 0.33009709]), array([8728, 2899], dtype=int64))
Interpretation
Count of Prediction: It shows the number of times each class label (0 and 1) was predicted. The values [5776, 5851] indicate that out of all predictions, 5874 were classified as class ‘0’ and 5851 as class ‘1’.
Probability of Prediction: This represents the proportion of each class label in the predictions. The values [0.50520341 0.49479659] suggest that approximately 50% of predictions are class ‘0’ which is negative, and 50% are class ‘1’, whcih is positive. This near-even split indicates that the classifier is randomly assigning class labels with almost equal probability to each class.
Accuracy: An accuracy of approximately 50% is observed. In the context of a binary classifier, an accuracy close to 50% suggests that the classifier’s performance is only slightly better than random guessing. This is expected, given the nature of the random classifier.
Precision: Precision for each class (0.7495744 for class ‘0’, 0.24821832 for class ‘1’) indicates the proportion of correctly predicted positive observations out of all predictions in that class. Higher precision for class ‘0’ suggests that the classifier is more reliable when it predicts an instance as class ‘0’ compared to class ‘1’.
Recall: Recall (0.49702108 for class ‘0’, 0.50446838 for class ‘1’) is the proportion of actual positives correctly identified. The values are close to 50%, showing the classifier’s limited capability in correctly identifying true cases for each class.
F-Score: F-Score is the harmonic mean of precision and recall. The values (0.60306807 for class ‘0’, 0.33009709 for class ‘1’) indicate the balance between precision and recall for each class, with class ‘0’ having a better balance compared to class ‘1’.
Model Tuning
Partion data to train/test split
Code
from sklearn.model_selection import train_test_splittest_ratio=0.2x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)y_train=y_train.flatten()y_test=y_test.flatten()print("x_train.shape :",x_train.shape)print("y_train.shape :",y_train.shape)print("X_test.shape :",x_test.shape)print("y_test.shape :",y_test.shape)
Hyper-Parameter tuning for Decision Tree Classification ((max_depth))
Code
from sklearn.metrics import mean_absolute_percentage_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.tree import DecisionTreeClassifier# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(1,40):# INITIALIZE MODEL model = DecisionTreeClassifier(max_depth=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i==1or i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)
hyperparam = 1
train error: 0.15138157187399204
test error: 0.15047291487532244
hyperparam = 10
train error: 0.0
test error: 0.0030094582975064487
hyperparam = 20
train error: 0.0
test error: 0.0021496130696474634
hyperparam = 30
train error: 0.0
test error: 0.0030094582975064487
For max_depth = 1, both the training and test errors are relatively high, with values around 0.15. This suggests that the decision tree is underfitting the data, the depth of tree needs to higer for better model fitting.
As the max_depth increase to 10 and beyond, both the training and test errors decrease significantly. When max_depth = 10, the training error is very low (close to 0), indicating that the model is fitting the training data extremely well. However, the test error remains relatively low, suggesting that the model is generalizing reasonably well to unseen data.
Interestingly, as max_depth furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don’t want this to happen for generalization purposes.
Overall, the results show that a max_depth of around 10 appears to be a good choice because it achieves low test error without overfitting the training data.
Convergence plot
Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Depth of tree (max depth)")plt.ylabel("Training (black) and test (blue) MAE (error)")i=1print(hyper_param[i],train_error[i],test_error[i])
2 0.07386302548113106 0.06835769561478934
The convergence plot clearly demonstrates that as the max_depth hyperparameter increases, both training and test errors demonstrate a pattern of stabilization. This observation alligns with the assumptions made in the previous section. Specifically, it indicates that beyond a certain depth, the decision tree model ceases to substantially improve its fit to the training data, and we could infer that increasing the max_depth further may lead to overfitting, as the model becomes overly complex and starts fitting noise in the training data, while failing to improve its generalization. Hence, the plot supports the notion that there exists an optimal max_depth value that strikes a balance between model complexity and performance on the test dataset.
Hyper-Parameter tuning for Decision Tree Classification (min_samples_splitint)
Code
# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(2,100):# INITIALIZE MODEL model = DecisionTreeClassifier(min_samples_split=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)
hyperparam = 10
train error: 0.0005375766046661649
test error: 0.003869303525365434
hyperparam = 20
train error: 0.0006450919255993979
test error: 0.0030094582975064487
hyperparam = 30
train error: 0.0008601225674658639
test error: 0.0034393809114359416
hyperparam = 40
train error: 0.0016127298139984947
test error: 0.004299226139294927
hyperparam = 50
train error: 0.0016127298139984947
test error: 0.004299226139294927
hyperparam = 60
train error: 0.0016127298139984947
test error: 0.004299226139294927
hyperparam = 70
train error: 0.0017202451349317277
test error: 0.004299226139294927
hyperparam = 80
train error: 0.0017202451349317277
test error: 0.004299226139294927
hyperparam = 90
train error: 0.0017202451349317277
test error: 0.004299226139294927
The min_samples_split parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.
As min_samples_split increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.
Beyond min_samples_split = 20, the training error remains low or slightly increases, indicating that the model still fits the training data well, however, the erros for test data follows a clear wave-like pattern.
Overall, the results suggest that a min_samples_split value around 30 provides a good balance between model complexity and generalization.
Convergence Plot
Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Minimum number of points in split (min_samples_split)")plt.ylabel("Training (black) and test (blue) MAE (error)")
Text(0, 0.5, 'Training (black) and test (blue) MAE (error)')
The convergence plot highlights a critical observation: there exists a specific range of min_samples_split values where the model performs well on both the training and test datasets. However, as min_samples_split goes beyond this range, the model’s performance on the test data deteriorates noticeably.
In practical terms, this indicates that we should seek a min_samples_split value that does not result in a significant drop in test data performance. The objective is to find the sweet spot where the model maintains good generalization to unseen data without overfitting or underfitting.
Re-train with optimal parameters
Code
# INITIALIZE MODEL model = DecisionTreeClassifier(max_depth=10, min_samples_split=30)model.fit(x_train,y_train) # TRAIN MODEL # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train)yp_test = model.predict(x_test)err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) print(" train error:",err1)print(" test error:" ,err2)
train error: 0.0008601225674658639
test error: 0.0034393809114359416
Parity Plot
Plotting y_pred vs y_data lets you see how good the fit is
The closer to the line y=x the better the fit (ypred=ydata –> prefect fit)
Training Data Fit: The black dots are closer to the red line, indicating a good fit for the training data.
Test Data Fit: The blue dots are spread out, indicating more variance in the fit for the test data compared to the training data.
Overall Performance: Since some blue points are far from the red line, it suggests the model may not generalize well to new, unseen data, possibly overfitting to the training data.
Conclusions(The final reaults are discussed with each section)
The decision tree with classification analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model’s predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.
To improve, we aim to adjust our model’s complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.
Data understanding
For this part of the analysis, I will be using the Framingham Heart Study dataset@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be GLUCOSE, in mmol/L (millimoles per liter).
Code
from sklearn.metrics import mean_absolute_percentage_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressor# Load the Frammingham Heart Study data setdata = pd.read_csv("data/frmgham2.csv")print(data.head())data.dropna(inplace=True)# Split target variables and predictor variablesy = data['GLUCOSE']x = data.drop('GLUCOSE', axis=1)x = x.iloc[:, 1:]x = x.valuesy = y.values# Normalize x=0.1+(x-np.min(x,axis=0))/(np.max(x,axis=0)-np.min(x,axis=0))y=0.1+(y-np.min(y,axis=0))/(np.max(y,axis=0)-np.min(y,axis=0))
Hyper-Parameter tuning for Decision Tree Classification ((max_depth))
Code
from sklearn.metrics import mean_absolute_percentage_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.tree import DecisionTreeClassifier# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(1,40):# INITIALIZE MODEL model = DecisionTreeRegressor(max_depth=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i==1or i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)
hyperparam = 1
train error: 0.035268787108589356
test error: 0.03418032867241032
hyperparam = 10
train error: 0.02038622946162721
test error: 0.04358912209657024
hyperparam = 20
train error: 0.005180126792851705
test error: 0.047937823543169375
hyperparam = 30
train error: 0.00022663570019033476
test error: 0.04882001990611059
For max_depth = 1, both the training and test errors are relatively low, with values around 0.03. This suggests that the decision tree is fitting the data well.
As the max_depth increase to 10 and beyond, the training erros starts to drop, however, the test error starts to go up.
Interestingly, as max_depth furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don’t want this to happen for generalization purposes.
Overall, the results show that a max_depth of around 1 to 10 appears to be a good choice because it achieves low test error without overfitting the training data.
Convergence plot
Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Depth of tree (max depth)")plt.ylabel("Training (black) and test (blue) MAE (error)")i=1print(hyper_param[i],train_error[i],test_error[i])
2 0.03442784374001432 0.032813904712789006
The convergence plot clearly demonstrates that as the max_depth hyperparameter increases, The point at which the lines start to diverge significantly is critical; it marks the transition from underfitting to the optimal complexity and then to overfitting. According to the plot a max_depth of 0 to 5 is optimal for a balance between training error and testing error.
Hyper-Parameter tuning for Decision Tree Regression (min_samples_splitint)
Code
# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(2,100):# INITIALIZE MODEL model = DecisionTreeRegressor(min_samples_split=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)
hyperparam = 10
train error: 0.014984530706046054
test error: 0.045871667532753846
hyperparam = 20
train error: 0.021962415929112074
test error: 0.043417885579028495
hyperparam = 30
train error: 0.026579851814848933
test error: 0.040156857780947484
hyperparam = 40
train error: 0.0279409667047845
test error: 0.038751571864489465
hyperparam = 50
train error: 0.028391822245632067
test error: 0.038300992283321324
hyperparam = 60
train error: 0.029240913243283456
test error: 0.036936685875858556
hyperparam = 70
train error: 0.029627133194660586
test error: 0.036694713507271985
hyperparam = 80
train error: 0.03032461525639734
test error: 0.03656614931125601
hyperparam = 90
train error: 0.03058836291359766
test error: 0.036432623399717076
The min_samples_split parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.
As min_samples_split increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.
Beyond min_samples_split = 20, the training error remains low, indicating that the model still fits the training data well, the test error drops at the same time.
Convergence Plot
Code
plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Minimum number of points in split (min_samples_split)")plt.ylabel("Training (black) and test (blue) MAE (error)")
Text(0, 0.5, 'Training (black) and test (blue) MAE (error)')
The convergence plot demonstrates that test error and train error converge to a point, roughly min_samples_spliy = 50.
Re-train with optimal parameters
Code
# INITIALIZE MODEL model = DecisionTreeRegressor(max_depth=1,min_samples_split=50)model.fit(x_train,y_train) # TRAIN MODEL # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train)yp_test = model.predict(x_test)err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) print(" train error:",err1)print(" test error:" ,err2)
train error: 0.03526878710858939
test error: 0.03418032867241034
Parity Plot
Plotting y_pred vs y_data lets you see how good the fit is
The closer to the line y=x the better the fit (ypred=ydata –> prefect fit)
Conclusions(The final reaults are discussed with each section)
The decision tree with regression analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model’s predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.
To improve, we aim to adjust our model’s complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.
Source Code
---title: "Decision Tree"format: html: page-layout: full code-fold: true code-copy: true code-tools: true code-overflow: wrap embed-resources: true---```{python}import sklearnfrom sklearn import datasetsimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport randomfrom collections import Counterfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_recall_fscore_support```## Introductions### Decision Trees:A Decision Tree is a tree-like structure where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents the tree outcome. The top node in a tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in a recursive manner called recursive partitioning. This tree-like structure helps in decision making.### Random Forest:Random Forest is an ensemble learning method, where multiple decision trees are combined to increase the overall performance. While a single decision tree might lead to overfitting, Random Forest averages multiple trees to reduce overfitting and improve prediction accuracy. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features.## MethodsFor this part of the analysis, I will be applying both `classification` and `regression` for decision tree analysis, I will be applying the classification methods to binary data Cardiovascular dieases `(CVD)` and regressino methods to continuous numeric data `GLUCOSE`.::: panel-tabset# Classification## Data and Class DistributionFor this part of the analysis, I will be using the Framingham Heart Study dataset\@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be `"CVD"`, which is Cardio Vascular diseases.```{python}# Load the Frammingham Heart Study data setdata = pd.read_csv("data/frmgham2.csv")data.head()# Split target variables and predictor variablesy = data['CVD']x = data.drop('CVD', axis=1)x = x.iloc[:, 1:]x = x.valuesy = y.values# Calculating class distribution for 'DIABETES'cd = data['CVD'].value_counts(normalize=True) *100print(cd)```The results show binary data, about roughly `75%` of data goes to 0 which is negative test for CVD, and about 25% of data goes to 1 which is positive test for CVD. We have a much higher percentages of data on negative tests, this could lead to a biased model favoring the majority class, we need a more detailed evaluation part to address this issue.## EDA```{python}data_eda = data.iloc[:, 1:]print(data_eda.describe())```### Correlation Matrix Heatmap```{python}corr = data.corr()sns.set_theme(style="white")f, ax = plt.subplots(figsize=(11, 9)) # Set up the matplotlib figurecmap = sns.diverging_palette(230, 20, as_cmap=True) # Generate a custom diverging colormap# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})plt.show();```## Baseline model for comparison```{python}## RANDOM CLASSIFIER def random_classifier(y_data): ypred=[]; max_label=np.max(y_data);#print(max_label)for i inrange(0,len(y_data)): ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))print("-----RANDOM CLASSIFIER-----")print("count of prediction:",Counter(ypred).values()) # counts the elements' frequencyprint("probability of prediction:",np.fromiter(Counter(ypred).values(), dtype=float)/len(y_data)) # counts the elements' frequencyprint("accuracy",accuracy_score(y_data, ypred))print("percision, recall, fscore,",precision_recall_fscore_support(y_data, ypred))np.random.seed(42) random.seed(42)random_classifier(y)```- **Interpretation****Count of Prediction:** It shows the number of times each class label (0 and 1) was predicted. The values \[5776, 5851\] indicate that out of all predictions, `5874` were classified as class '0' and `5851` as class '1'.**Probability of Prediction:** This represents the proportion of each class label in the predictions. The values `[0.50520341 0.49479659]` suggest that approximately `50%` of predictions are class '0' which is negative, and `50%` are class '1', whcih is positive. This near-even split indicates that the classifier is randomly assigning class labels with almost equal probability to each class.**Accuracy:** An accuracy of approximately `50%` is observed. In the context of a binary classifier, an accuracy close to 50% suggests that the classifier's performance is only slightly better than random guessing. This is expected, given the nature of the random classifier.**Precision:** Precision for each class (`0.7495744` for class '0', `0.24821832` for class '1') indicates the proportion of correctly predicted positive observations out of all predictions in that class. Higher precision for class '0' suggests that the classifier is more reliable when it predicts an instance as class '0' compared to class '1'.**Recall:** Recall (0.49702108 for class '0', `0.50446838` for class '1') is the proportion of actual positives correctly identified. The values are close to `50%`, showing the classifier's limited capability in correctly identifying true cases for each class.**F-Score:** F-Score is the harmonic mean of precision and recall. The values (`0.60306807` for class '0', `0.33009709` for class '1') indicate the balance between precision and recall for each class, with class '0' having a better balance compared to class '1'.## Model Tuning### Partion data to train/test split```{python}from sklearn.model_selection import train_test_splittest_ratio=0.2x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)y_train=y_train.flatten()y_test=y_test.flatten()print("x_train.shape :",x_train.shape)print("y_train.shape :",y_train.shape)print("X_test.shape :",x_test.shape)print("y_test.shape :",y_test.shape)```### Hyper-Parameter tuning for Decision Tree Classification ((max_depth))```{python}from sklearn.metrics import mean_absolute_percentage_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.tree import DecisionTreeClassifier# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(1,40):# INITIALIZE MODEL model = DecisionTreeClassifier(max_depth=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i==1or i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)```- For `max_depth` = 1, both the training and test errors are relatively high, with values around 0.15. This suggests that the decision tree is underfitting the data, the depth of tree needs to higer for better model fitting.- As the `max_depth` increase to 10 and beyond, both the training and test errors decrease significantly. When max_depth = 10, the training error is very low (close to 0), indicating that the model is fitting the training data extremely well. However, the test error remains relatively low, suggesting that the model is generalizing reasonably well to unseen data.- Interestingly, as `max_depth` furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don't want this to happen for generalization purposes.- Overall, the results show that a `max_depth` of around 10 appears to be a good choice because it achieves low test error without overfitting the training data.- **Convergence plot**```{python}plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Depth of tree (max depth)")plt.ylabel("Training (black) and test (blue) MAE (error)")i=1print(hyper_param[i],train_error[i],test_error[i])```- The convergence plot clearly demonstrates that as the `max_depth` hyperparameter increases, both training and test errors demonstrate a pattern of stabilization. This observation alligns with the assumptions made in the previous section. Specifically, it indicates that beyond a certain depth, the decision tree model ceases to substantially improve its fit to the training data, and we could infer that increasing the `max_depth` further may lead to overfitting, as the model becomes overly complex and starts fitting noise in the training data, while failing to improve its generalization. Hence, the plot supports the notion that there exists an optimal `max_depth` value that strikes a balance between model complexity and performance on the test dataset.### Hyper-Parameter tuning for Decision Tree Classification (min_samples_splitint)```{python}# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(2,100):# INITIALIZE MODEL model = DecisionTreeClassifier(min_samples_split=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)```- The `min_samples_split` parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.- As `min_samples_split` increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.- Beyond `min_samples_split` = 20, the training error remains low or slightly increases, indicating that the model still fits the training data well, however, the erros for test data follows a clear wave-like pattern.- Overall, the results suggest that a `min_samples_split` value around 30 provides a good balance between model complexity and generalization.**Convergence Plot**```{python}plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Minimum number of points in split (min_samples_split)")plt.ylabel("Training (black) and test (blue) MAE (error)")```The convergence plot highlights a critical observation: there exists a specific range of `min_samples_split` values where the model performs well on both the training and test datasets. However, as `min_samples_split` goes beyond this range, the model's performance on the test data deteriorates noticeably.In practical terms, this indicates that we should seek a `min_samples_split` value that does not result in a significant drop in test data performance. The objective is to find the sweet spot where the model maintains good generalization to unseen data without overfitting or underfitting.### Re-train with optimal parameters```{python}# INITIALIZE MODEL model = DecisionTreeClassifier(max_depth=10, min_samples_split=30)model.fit(x_train,y_train) # TRAIN MODEL # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train)yp_test = model.predict(x_test)err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) print(" train error:",err1)print(" test error:" ,err2)```### Parity Plot- Plotting y_pred vs y_data lets you see how good the fit is- The closer to the line y=x the better the fit (ypred=ydata –\> prefect fit)```{python}plt.plot(y_train,yp_train ,"o", color='k')plt.plot(y_test,yp_test ,"o", color='b')plt.plot(y_train,y_train ,"-", color='r')plt.xlabel("y_data")plt.ylabel("y_pred (blue=test)(black=Train)")```- **Training Data Fit:** The black dots are closer to the red line, indicating a good fit for the training data.- **Test Data Fit:** The blue dots are spread out, indicating more variance in the fit for the test data compared to the training data.- **Overall Performance:** Since some blue points are far from the red line, it suggests the model may not generalize well to new, unseen data, possibly overfitting to the training data.### Plot Tree```{python}from sklearn import treedef plot_tree(model): fig = plt.figure(figsize=(25,20)) _ = tree.plot_tree(model, filled=True) plt.show()plot_tree(model)```## Conclusions(The final reaults are discussed with each section)- The decision tree with classification analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model's predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.- To improve, we aim to adjust our model's complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.# Regression## Data understandingFor this part of the analysis, I will be using the Framingham Heart Study dataset\@RProjectFramingham for both Decision Tree and Random Forest Analysis, the target variable for this analysis would be `GLUCOSE`, in mmol/L (millimoles per liter).```{python}from sklearn.metrics import mean_absolute_percentage_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressor# Load the Frammingham Heart Study data setdata = pd.read_csv("data/frmgham2.csv")print(data.head())data.dropna(inplace=True)# Split target variables and predictor variablesy = data['GLUCOSE']x = data.drop('GLUCOSE', axis=1)x = x.iloc[:, 1:]x = x.valuesy = y.values# Normalize x=0.1+(x-np.min(x,axis=0))/(np.max(x,axis=0)-np.min(x,axis=0))y=0.1+(y-np.min(y,axis=0))/(np.max(y,axis=0)-np.min(y,axis=0))```## EDA```{python}data_eda = data.iloc[:, 1:]print(data_eda.describe())```### Correlation Matrix Heatmap```{python}corr = data.corr()sns.set_theme(style="white")f, ax = plt.subplots(figsize=(11, 9)) # Set up the matplotlib figurecmap = sns.diverging_palette(230, 20, as_cmap=True) # Generate a custom diverging colormap# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})plt.show();```## Model Tuning### Partion data to train/test split```{python}from sklearn.model_selection import train_test_splittest_ratio=0.2x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_ratio, random_state=0)y_train=y_train.flatten()y_test=y_test.flatten()print("x_train.shape :",x_train.shape)print("y_train.shape :",y_train.shape)print("X_test.shape :",x_test.shape)print("y_test.shape :",y_test.shape)```### Hyper-Parameter tuning for Decision Tree Classification ((max_depth))```{python}from sklearn.metrics import mean_absolute_percentage_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.tree import DecisionTreeClassifier# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(1,40):# INITIALIZE MODEL model = DecisionTreeRegressor(max_depth=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i==1or i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)```- For `max_depth` = 1, both the training and test errors are relatively low, with values around 0.03. This suggests that the decision tree is fitting the data well.- As the `max_depth` increase to 10 and beyond, the training erros starts to drop, however, the test error starts to go up.- Interestingly, as `max_depth` furthur increase to 20 and 30, the training error remains very low (close to 0), but the test error starts to increase slightly. This is a sign of overfitting. The model is capturing the data too well, we don't want this to happen for generalization purposes.- Overall, the results show that a `max_depth` of around 1 to 10 appears to be a good choice because it achieves low test error without overfitting the training data.- **Convergence plot**```{python}plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Depth of tree (max depth)")plt.ylabel("Training (black) and test (blue) MAE (error)")i=1print(hyper_param[i],train_error[i],test_error[i])```- The convergence plot clearly demonstrates that as the `max_depth` hyperparameter increases, The point at which the lines start to diverge significantly is critical; it marks the transition from underfitting to the optimal complexity and then to overfitting. According to the plot a `max_depth` of 0 to 5 is optimal for a balance between training error and testing error.### Hyper-Parameter tuning for Decision Tree Regression (min_samples_splitint)```{python}# HYPER PARAMETER SEARCH FOR OPTIMAL NUMBER OF NEIGHBORS hyper_param=[]train_error=[]test_error=[]# LOOP OVER HYPER-PARAMfor i inrange(2,100):# INITIALIZE MODEL model = DecisionTreeRegressor(min_samples_split=i)# TRAIN MODEL model.fit(x_train,y_train)# OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train) yp_test = model.predict(x_test)# shift=1+np.min(y_train) #add shift to remove division by zero err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) # err1=100.0*np.mean(np.absolute((yp_train-y_train)/y_train))# err2=100.0*np.mean(np.absolute((yp_test-y_test)/y_test)) hyper_param.append(i) train_error.append(err1) test_error.append(err2)if(i%10==0):print("hyperparam =",i)print(" train error:",err1)print(" test error:" ,err2)```- The `min_samples_split` parameter controls the minimum number of samples required to split an internal node during the construction of the decision tree.- As `min_samples_split` increases from 10 to 20, both the training and test errors generally decrease. This suggests that when internal nodes require a larger number of samples to split.- Beyond `min_samples_split` = 20, the training error remains low, indicating that the model still fits the training data well, the test error drops at the same time.**Convergence Plot**```{python}plt.plot(hyper_param,train_error ,linewidth=2, color='k')plt.plot(hyper_param,test_error ,linewidth=2, color='b')plt.xlabel("Minimum number of points in split (min_samples_split)")plt.ylabel("Training (black) and test (blue) MAE (error)")```The convergence plot demonstrates that test error and train error converge to a point, roughly `min_samples_spliy = 50`.### Re-train with optimal parameters```{python}# INITIALIZE MODEL model = DecisionTreeRegressor(max_depth=1,min_samples_split=50)model.fit(x_train,y_train) # TRAIN MODEL # OUTPUT PREDICTIONS FOR TRAINING AND TEST SET yp_train = model.predict(x_train)yp_test = model.predict(x_test)err1=mean_absolute_error(y_train, yp_train) err2=mean_absolute_error(y_test, yp_test) print(" train error:",err1)print(" test error:" ,err2)```### Parity Plot- Plotting y_pred vs y_data lets you see how good the fit is- The closer to the line y=x the better the fit (ypred=ydata –\> prefect fit)```{python}plt.plot(y_train,yp_train ,"o", color='k')plt.plot(y_test,yp_test ,"o", color='b')plt.plot(y_train,y_train ,"-", color='r')plt.xlabel("y_data")plt.ylabel("y_pred (blue=test)(black=Train)")```### Plot Tree```{python}from sklearn import treedef plot_tree(model): fig = plt.figure(figsize=(25,20)) _ = tree.plot_tree(model, filled=True) plt.show()plot_tree(model)```## Conclusions(The final reaults are discussed with each section)- The decision tree with regression analysis gave us insights into heart disease risks, which performed quite well when looking back at the data we trained it on. However, when faced with new, unseen data, the model's predictions were inconsistent. This suggests that while our model has learned well from past data, it needs to be better tuned to handle new information effectively.- To improve, we aim to adjust our model's complexity. When the model is too simple, and we miss important patterns, when the model is too complex, and we face the risk if overfitting instead of generalization. The goal is to build a model that not only learns from the past but is also ptovide usefull insight for generalization.:::